Course: VISUAL ANALYTICS FOR POLICY AND MANAGEMENT

Prof. José Manuel Magallanes, PhD

  • Visiting Professor of Computational Policy at Evans School of Public Policy and Governance, and eScience Institute Senior Data Science Fellow, University of Washington.
  • Professor of Government and Political Methodology, Pontificia Universidad Católica del Perú.

Tabular data - Bivariate relationships I: Categorical-Categorical

We analyze two variables to find out if there might be some kind of association between them. Even though that may be difficult to clearly identify, bivariate analysis still helps reveal signs of association that may serve at least to raise concern.

Let me use the data about crime from the Seattle Open Data portal (I have formatted this file previously).

# clear memory
rm(list = ls())
# collecting the data
link="https://github.com/EvansDataScience/data/raw/master/crime.RData"
load(file = url(link)) #loaded as 'crime'

Let’s see what kind of data we have in crime data table:

# categorical? numerical?
str(crime,width = 50,strict.width='cut')
## 'data.frame':    499698 obs. of  17 variables:
##  $ Report.Number              : chr  "201300002"..
##  $ Occurred.Date              : Date, format:  ...
##  $ year                       : num  2013 2013 2..
##  $ month                      : num  7 7 7 7 7 7..
##  $ weekday                    : Ord.factor w/ 7"..
##  $ Occurred.Time              : num  1930 1917 1..
##  $ Occurred.DayTime           : Ord.factor w/ 4"..
##  $ Reported.Date              : Date, format:  ...
##  $ Reported.Time              : num  1722 2052 3..
##  $ DaysToReport               : num  1 0 1 1 0 0..
##  $ crimecat                   : Factor w/ 20 le"..
##  $ Crime.Subcategory          : Factor w/ 30 le"..
##  $ Primary.Offense.Description: Factor w/ 144 l"..
##  $ Precinct                   : Factor w/ 5 lev"..
##  $ Sector                     : Factor w/ 23 le"..
##  $ Beat                       : Factor w/ 64 le"..
##  $ Neighborhood               : Factor w/ 58 le"..

Categorical-Categorical relationships

While we use frequency tables in the univariate case, for the bivariate case (cat-cat) we prepare contingency tables. Let’s select a couple of categorical variables:

This contingency table shows counts for each combination of levels:

(PrecinctDaytime=table(crime$Precinct,crime$Occurred.DayTime))
##            
##               day afternoon evening night
##   EAST      15976     20774   17380 19880
##   NORTH     33744     48754   39867 37942
##   SOUTH     17322     22147   16240 15497
##   SOUTHWEST 10595     14221   11169 11034
##   WEST      30366     48931   33766 30925

When a table tries to hypothesize a relationship, you should have the independent variable in the columns, and the dependent one in the rows. Interpretation is difficult when you have counts so it is better to have percents. Percents should be computed by column to see how the levels of the dependent variable varies by each level of the independent one (reading along rows):

# computing column percent from contingency table
library(magrittr) # for %>%
(PrecDayti_mgCol=prop.table(PrecinctDaytime,
                            margin = 2) #2 means by column
                             %>%round(.,3))
##            
##               day afternoon evening night
##   EAST      0.148     0.134   0.147 0.172
##   NORTH     0.312     0.315   0.337 0.329
##   SOUTH     0.160     0.143   0.137 0.134
##   SOUTHWEST 0.098     0.092   0.094 0.096
##   WEST      0.281     0.316   0.285 0.268

The previous table shows you how the crimes that occur in a precinct are affected by the time they happen. So you need a plot that allows to highlight those differences accross time.

As before, we need to turn this table into a data frame:

#making a data frame from contingency table
(PrecDaytiDF=as.data.frame(PrecinctDaytime))
##         Var1      Var2  Freq
## 1       EAST       day 15976
## 2      NORTH       day 33744
## 3      SOUTH       day 17322
## 4  SOUTHWEST       day 10595
## 5       WEST       day 30366
## 6       EAST afternoon 20774
## 7      NORTH afternoon 48754
## 8      SOUTH afternoon 22147
## 9  SOUTHWEST afternoon 14221
## 10      WEST afternoon 48931
## 11      EAST   evening 17380
## 12     NORTH   evening 39867
## 13     SOUTH   evening 16240
## 14 SOUTHWEST   evening 11169
## 15      WEST   evening 33766
## 16      EAST     night 19880
## 17     NORTH     night 37942
## 18     SOUTH     night 15497
## 19 SOUTHWEST     night 11034
## 20      WEST     night 30925

We also have the table with marginal percents by column:

as.data.frame(PrecDayti_mgCol)
##         Var1      Var2  Freq
## 1       EAST       day 0.148
## 2      NORTH       day 0.312
## 3      SOUTH       day 0.160
## 4  SOUTHWEST       day 0.098
## 5       WEST       day 0.281
## 6       EAST afternoon 0.134
## 7      NORTH afternoon 0.315
## 8      SOUTH afternoon 0.143
## 9  SOUTHWEST afternoon 0.092
## 10      WEST afternoon 0.316
## 11      EAST   evening 0.147
## 12     NORTH   evening 0.337
## 13     SOUTH   evening 0.137
## 14 SOUTHWEST   evening 0.094
## 15      WEST   evening 0.285
## 16      EAST     night 0.172
## 17     NORTH     night 0.329
## 18     SOUTH     night 0.134
## 19 SOUTHWEST     night 0.096
## 20      WEST     night 0.268

We should simply add the last column to the data frame of counts.

PrecDaytiDF$share=as.data.frame(PrecDayti_mgCol)[,3]
PrecDaytiDF
##         Var1      Var2  Freq share
## 1       EAST       day 15976 0.148
## 2      NORTH       day 33744 0.312
## 3      SOUTH       day 17322 0.160
## 4  SOUTHWEST       day 10595 0.098
## 5       WEST       day 30366 0.281
## 6       EAST afternoon 20774 0.134
## 7      NORTH afternoon 48754 0.315
## 8      SOUTH afternoon 22147 0.143
## 9  SOUTHWEST afternoon 14221 0.092
## 10      WEST afternoon 48931 0.316
## 11      EAST   evening 17380 0.147
## 12     NORTH   evening 39867 0.337
## 13     SOUTH   evening 16240 0.137
## 14 SOUTHWEST   evening 11169 0.094
## 15      WEST   evening 33766 0.285
## 16      EAST     night 19880 0.172
## 17     NORTH     night 37942 0.329
## 18     SOUTH     night 15497 0.134
## 19 SOUTHWEST     night 11034 0.096
## 20      WEST     night 30925 0.268

We can change the names of the previous data frame:

names(PrecDaytiDF)[1:3]=c("precinct","daytime","counts")

#then
PrecDaytiDF
##     precinct   daytime counts share
## 1       EAST       day  15976 0.148
## 2      NORTH       day  33744 0.312
## 3      SOUTH       day  17322 0.160
## 4  SOUTHWEST       day  10595 0.098
## 5       WEST       day  30366 0.281
## 6       EAST afternoon  20774 0.134
## 7      NORTH afternoon  48754 0.315
## 8      SOUTH afternoon  22147 0.143
## 9  SOUTHWEST afternoon  14221 0.092
## 10      WEST afternoon  48931 0.316
## 11      EAST   evening  17380 0.147
## 12     NORTH   evening  39867 0.337
## 13     SOUTH   evening  16240 0.137
## 14 SOUTHWEST   evening  11169 0.094
## 15      WEST   evening  33766 0.285
## 16      EAST     night  19880 0.172
## 17     NORTH     night  37942 0.329
## 18     SOUTH     night  15497 0.134
## 19 SOUTHWEST     night  11034 0.096
## 20      WEST     night  30925 0.268

We will use ggplot to represent the contingency table:

library(ggplot2)
base1=ggplot(data=PrecDaytiDF, 
             aes(x=daytime,
                 y=share,
                 fill=precinct)) # fill brings a legend

Then, you play with some positions for the bar. First, the dodge style:

barDodge= base1 +  geom_bar(stat="identity",
                            position ='dodge') 
barDodge 

The second is the stack style:

barStacked = base1 + geom_bar(stat = "identity",
                              position = 'stack')#default
barStacked 

The stacked version will help more than the dodged one as it reveals better the values in the contingency table:

PrecDayti_mgCol
##            
##               day afternoon evening night
##   EAST      0.148     0.134   0.147 0.172
##   NORTH     0.312     0.315   0.337 0.329
##   SOUTH     0.160     0.143   0.137 0.134
##   SOUTHWEST 0.098     0.092   0.094 0.096
##   WEST      0.281     0.316   0.285 0.268

So, we continue with adding some elements to this one:

library(scales)
#annotating
barStackedAnn= barStacked + geom_text(size = 5,# check below:
                             position = position_stack(vjust = 0.5),# center
                             aes(label=percent(share,accuracy = 0.1)))# percent format

barStackedAnn = barStackedAnn + scale_y_continuous(labels = percent)

barStackedAnn

Since the precinct is nominal, and you see some marked differences along the percents, you can reorder the precinct blocks with reorder():

base1=ggplot(data=PrecDaytiDF, 
             aes(x=daytime, y=share,
                 fill=reorder(precinct,share))) ## reordering

barStacked = base1 + geom_bar(stat = "identity",
                              position = 'stack')
barStacked= barStacked + geom_text(size = 5,
                             position = position_stack(vjust = 0.5),
                             aes(label=percent(share,accuracy = 0.1)))

barStacked = barStacked + scale_y_continuous(labels = percent)

barStacked

Let me show you a more complex situation:

# contingency table with many levels:

(CrimeDay=table(crime$crimecat,crime$Occurred.DayTime))
##                            
##                               day afternoon evening night
##   AGGRAVATED ASSAULT         3564      5366    4884  7501
##   ARSON                       196       167     191   486
##   BURGLARY                  24139     22288   14121 16082
##   CAR PROWL                 26740     38273   42595 34839
##   DISORDERLY CONDUCT           41        81      67    79
##   DUI                         706       939    2038  8522
##   FAMILY OFFENSE-NONVIOLENT  1748      2516    1217  1120
##   GAMBLE                        4         4       7     2
##   HOMICIDE                     41        46      49   131
##   LIQUOR LAW VIOLATION        112       491     410   606
##   LOITERING                    20        31      25     9
##   NARCOTIC                   2415      6416    3924  4109
##   PORNOGRAPHY                  65        53      17    31
##   PROSTITUTION                115       675    1425  1340
##   RAPE                        332       318     354   854
##   ROBBERY                    2584      4737    4139  5372
##   SEX OFFENSE-OTHER          1501      1759    1014  1776
##   THEFT                     38687     64868   38980 28410
##   TRESPASS                   4848      5184    2598  3289
##   WEAPON                      735      1445     947  1624

This contingency table has one categorical variables with several levels, let’s prepare a data frame as before:

#making a data frame from contingency table
CrimeDayDF=as.data.frame(CrimeDay)
#marginal
CrimeDay_mgCol=prop.table(CrimeDay,margin = 2)
#renaming:
names(CrimeDayDF)=c("crime","daytime","counts")
#adding marginal
CrimeDayDF$share=as.data.frame(CrimeDay_mgCol)[,3]
# result for ggplot:
head(CrimeDayDF,20)
##                        crime daytime counts        share
## 1         AGGRAVATED ASSAULT     day   3564 3.281980e-02
## 2                      ARSON     day    196 1.804905e-03
## 3                   BURGLARY     day  24139 2.222887e-01
## 4                  CAR PROWL     day  26740 2.462405e-01
## 5         DISORDERLY CONDUCT     day     41 3.775566e-04
## 6                        DUI     day    706 6.501340e-03
## 7  FAMILY OFFENSE-NONVIOLENT     day   1748 1.609680e-02
## 8                     GAMBLE     day      4 3.683479e-05
## 9                   HOMICIDE     day     41 3.775566e-04
## 10      LIQUOR LAW VIOLATION     day    112 1.031374e-03
## 11                 LOITERING     day     20 1.841739e-04
## 12                  NARCOTIC     day   2415 2.223900e-02
## 13               PORNOGRAPHY     day     65 5.985653e-04
## 14              PROSTITUTION     day    115 1.059000e-03
## 15                      RAPE     day    332 3.057287e-03
## 16                   ROBBERY     day   2584 2.379527e-02
## 17         SEX OFFENSE-OTHER     day   1501 1.382225e-02
## 18                     THEFT     day  38687 3.562568e-01
## 19                  TRESPASS     day   4848 4.464376e-02
## 20                    WEAPON     day    735 6.768392e-03

Sometimes, a simple contingency table does not need to be plotted in order to reveal salient relationships; but in this case a visual may be needed.

As before, let’s request a stacked barplot:

# bad idea
base2=ggplot(data=CrimeDayDF,
             aes(x=daytime,y=share,fill=crime))

base2=base2 + geom_bar(stat = "identity", position = 'fill') + 
        geom_text(size = 3, 
                  position = position_stack(vjust = 0.5),
                  aes(label=percent(share,accuracy = 0.1)))

barStacked2 = base2 + scale_y_continuous(labels = percent)

barStacked2

This plot will need a lot of work, so using it may not be a good strategy.

A first option you be to use a barplot with facets with bars dodged. Let’a make the first attempt.

# base with only X and Y in 'aes()'
baseBar = ggplot(CrimeDayDF, aes(x = crime, y = share ) ) 

#the bars
barPlot  = baseBar + geom_bar( stat = "identity" ) 

barPlot

Now see the facets:

# bar per day time with 'facet'
barsFt = barPlot + facet_grid(~ daytime) 

barsFt

This does not look like the crosstable yet; let’s solve that:

barsFt + coord_flip()

The type of crime is not ordinal, then we could reorder the bars:

# new base
baseRE  = ggplot(CrimeDayDF, 
                 aes(x = reorder(crime, share), #here
                     y = share ) ) + theme_minimal()

barPlotRE = baseRE + geom_bar( stat = "identity" ) 
barFtRE = barPlotRE + facet_grid( ~ daytime) 
barFtRE= barFtRE + coord_flip() 


barFtRE

Let’s work on the crime labels

barFtRE=barFtRE + theme(axis.text.y = element_text(size=7,angle = 20)) 
barFtRE

Would you annotate the bars:

barREann= barFtRE+ geom_text(aes(label=round(share,2)),
                             nudge_y = 0.1)
barREann

Let’s annotate conditionally instead:

barCond=barFtRE + geom_text(aes(label=ifelse(share>0.1,# condition to annotate
                                      round(share,2),"")),
                     nudge_y = 0.1)
barCond

What about percents instead:

barFtRE + geom_text(aes(label=ifelse(share>0.1,
                                      percent(share,accuracy = 1),# %
                                     "")),
                     nudge_y = 0.1,size=3) + 
           scale_y_continuous(labels = percent_format(accuracy = 1,suffix="")) #%

Tabular data - Bivariate relationships II: Categorical-Numerical

Let’s keep using the same data on crimes. The next cases will show how a categorical variable can help us better understand the behavior of a numeric variable.

Let’s take a look at a variable that informs the amount of days it takes someone to report a crime:

# stats of days to report
# notice the spread of values.
summary(crime$DaysToReport)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##     0.00     0.00     0.00     7.65     1.00 36525.00        2

The max is a very high value. Let me see the crimes that took the longest:

crime[which.max(crime$DaysToReport),]
##       Report.Number Occurred.Date year month weekday Occurred.Time
## 6783 20080000465209    1908-12-13 1908    12  Sunday          2114
##      Occurred.DayTime Reported.Date Reported.Time DaysToReport crimecat
## 6783          evening    2008-12-13          2114        36525      DUI
##      Crime.Subcategory Primary.Offense.Description Precinct Sector Beat
## 6783               DUI                  DUI-LIQUOR     EAST      G   G2
##                  Neighborhood
## 6783 CENTRAL AREA/SQUIRE PARK

Do you think this is right? This looks like a mistyping, as the Reported.Date is very similar, exactly 100 years later. Let’s alter the Ocurred.Date value.

crime[6783,'Occurred.Date']=crime[6783,'Reported.Date']

We also need to recompute the value DaysToReport value:

crime[6783,'DaysToReport']=difftime(crime[6783,'Reported.Date'],
                                    crime[6783,'Occurred.Date'],
                                    units = "days")%>%as.numeric()

The weekday and the year may need to be updated:

library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
crime[6783,'weekday']=wday(crime[6783,'Occurred.Date'], 
                           label=TRUE,
                           abbr = FALSE)
crime[6783,'year']=year(crime[6783,'Occurred.Date'])

Let’s use again the category Precinct with the numerical DaysToReport. Let’s just keep the non-missing data in the table this time:

crime_nona=crime[complete.cases(crime),]

Let me get the median for each precinct:

# summary: median by groups
aggregate(data=crime_nona, DaysToReport~Precinct,median)
##    Precinct DaysToReport
## 1      EAST            0
## 2     NORTH            0
## 3     SOUTH            0
## 4 SOUTHWEST            0
## 5      WEST            0

As you see, 50% of the cases are reported the same day. Let’s request a boxplot for each precinct:

# boxplot of days to report per precinct

library(ggplot2)
base=ggplot(data=crime_nona,
            aes(x=Precinct,y=DaysToReport))

base + geom_boxplot()

The plot above would not give so much insight, there is so much noise. Let’s check other statistics beside the median:

# using "summary" function
tapply(crime_nona$DaysToReport,
       crime_nona$Precinct, summary)
## $EAST
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##     0.000     0.000     0.000     6.726     1.000 10622.000 
## 
## $NORTH
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##     0.000     0.000     0.000     7.104     1.000 14268.000 
## 
## $SOUTH
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     0.00     0.00     0.00     7.22     1.00 11292.00 
## 
## $SOUTHWEST
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     0.00     0.00     0.00    11.07     1.00 11992.00 
## 
## $WEST
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##     0.000     0.000     0.000     4.899     1.000 16801.000

From the information above, you know that for each precinct, the 75% of crimes are reported in a day or less. If we consider that situation as the expected behavior, let me keep the ones that take more than a day using ggarrange:

str(crime)

library(ggpubr)

baseWeek=ggplot(data=crime_nona[crime_nona$DaysToReport>=7,],
            aes(x=Precinct,y=DaysToReport)) 
boxWeek=baseWeek + geom_boxplot() + labs(title = "week and above")

baseMonth=ggplot(data=crime_nona[crime_nona$DaysToReport>=30,],
            aes(x=Precinct,y=DaysToReport))
boxMonth=baseMonth + geom_boxplot() + labs(title = "month and above")


baseYear=ggplot(data=crime_nona[crime_nona$DaysToReport>=365,],
            aes(x=Precinct,y=DaysToReport)) 
boxYear=baseYear + geom_boxplot() + labs(title = "year and above")



#all in one:
ggarrange(boxWeek,boxMonth,boxYear,ncol = 1)

Up to this point, you need to be planing a good story. The situation is different for each case, but let’s build our visual from the crimes that take a year or longer to report.

crimeYear=crime_nona[crime_nona$DaysToReport>=365,]

Let me see if flipping helps you see better:

titleText="Crimes that took longer than one year to report"

baseYear=ggplot(data=crimeYear,
            aes(x=Precinct,
                y=DaysToReport)) 
boxYear=baseYear + geom_boxplot() + 
        labs(title = titleText)
# flipping
boxYear  + coord_flip()

If we are showing the days in takes above a year, we might change the unit to years instead of days:

crimeYear$YearsToReport=crimeYear$DaysToReport/365

Let’s redo the previous boxplot, but using reordering the category by the median of the numeric variable:

baseYear=ggplot(data=crimeYear,
            aes(x=reorder(Precinct,
                          YearsToReport,
                          median),
                y=YearsToReport)) 
boxYear=baseYear + geom_boxplot() + labs(title =titleText)
# flipping
boxYear  + coord_flip()

What if we use the histogram:

baseHY=ggplot(data=crimeYear,
            aes(x=YearsToReport)) 
histHY=baseHY + geom_histogram(aes(fill=Precinct), 
                              color='black') #color the border
histHY  
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

You need facets:

histHY + facet_grid(~Precinct)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The alternative without legend:

histHY + facet_grid(Precinct~.) + guides(fill="none")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

What about reordering:

histHYre= histHY + facet_grid(reorder(Precinct,
                                  -DaysToReport,
                                  median)~.) + guides(fill="none")
histHYre
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Another common visual is the mean-error plot, which shows the mean of the numeric variable including a confidence interval. Let me first recall the two variables I have been using:

crimeYear[,c('Precinct', 'YearsToReport')] %>%head(20)
##       Precinct YearsToReport
## 1885     NORTH      1.000000
## 2122 SOUTHWEST      1.002740
## 2886      WEST      3.569863
## 2901      WEST      3.005479
## 2977     NORTH      2.704110
## 3029     SOUTH      1.301370
## 4317     NORTH      5.005479
## 4502      WEST      1.005479
## 5366 SOUTHWEST      1.991781
## 5848      EAST      1.000000
## 6053      EAST      1.024658
## 6084      EAST      4.380822
## 7019     NORTH      1.731507
## 7020     NORTH      2.583562
## 7914      EAST      1.002740
## 7995      WEST     46.030137
## 7996     NORTH     39.090411
## 8003 SOUTHWEST     31.668493
## 8005      WEST     31.221918
## 8007 SOUTHWEST     32.854795

The plan is to show the mean per precinct:

library(Rmisc)
## Loading required package: lattice
## Loading required package: plyr
## 
## Attaching package: 'plyr'
## The following object is masked from 'package:ggpubr':
## 
##     mutate
Rmisc::group.CI(data=crimeYear,
          YearsToReport~Precinct)
##    Precinct YearsToReport.upper YearsToReport.mean YearsToReport.lower
## 1      EAST            3.387550           2.916302            2.445055
## 2     NORTH            3.533030           3.219861            2.906691
## 3     SOUTH            4.203968           3.717830            3.231692
## 4 SOUTHWEST            5.012880           4.335640            3.658401
## 5      WEST            3.075696           2.668819            2.261941

Let’s represent that:

baseMEANs=ggplot(crimeYear, aes(x=Precinct,
                             y=YearsToReport)) +
        theme_classic()
pointMEANS=baseMEANs + stat_summary(fun = mean, 
                                    geom = "point")
pointMEANS 

We can add now the error bar:

pointErrors=pointMEANS + stat_summary(fun.data = mean_ci,
                                      geom = "errorbar") 
pointErrors

Error bars have a huge problem, they give you the illusion of symmetry. So, I recommend you include the data in the plot:

BarJit=pointErrors + geom_jitter(colour="blue",
                             alpha=0.2 #transparency
                             )
BarJit

Some might prefer a logarithmic scale on the vertical axis:

BarJit + scale_y_log10(breaks=c(1,1.5,3,10,25,50)) + geom_hline(yintercept = 50,linetype='dashed')

Tabular data - Bivariate relationships III: Numerical-Numerical

Numeric-Time data

Let me use one of the date variables in the crime data set (the one with no missing values):

summary(crime_nona$Occurred.Date)
##         Min.      1st Qu.       Median         Mean      3rd Qu.         Max. 
## "1964-06-15" "2010-10-29" "2013-10-18" "2013-08-10" "2016-05-11" "2018-11-06"

A date is to be repeated if other crimes occur the same day. Then we should prepare a frequency tables of those dates:

crimeDate=as.data.frame(table(crime_nona$Occurred.Date))  # date will be a factor
head(crimeDate,10)
##          Var1 Freq
## 1  1964-06-15    1
## 2  1973-01-01    1
## 3  1975-12-16    1
## 4  1978-01-01    1
## 5  1979-01-28    1
## 6  1979-07-04    1
## 7  1980-01-01    1
## 8  1981-02-14    1
## 9  1981-08-22    1
## 10 1985-09-13    1

The column with dates resulted into a factor when we computed the frequecy table. Let’s change the factor to date, and also rename the columns:

names(crimeDate)=c("date",'count')
#formatting column in Freq Table:
crimeDate$date=as.Date(crimeDate$date)

So now you have:

head(crimeDate)
##         date count
## 1 1964-06-15     1
## 2 1973-01-01     1
## 3 1975-12-16     1
## 4 1978-01-01     1
## 5 1979-01-28     1
## 6 1979-07-04     1

Let’s show the line plot:

base=ggplot(crimeDate,
            aes(x=date,y=count))
base  + geom_line(alpha=0.3) 

Let’s zoom-in to dates starting in 2010:

start <- as.Date("2010-1-1")
end <- NA
base=ggplot(crimeDate,
            aes(x=date,y=count))
base  + geom_line(alpha=0.3) + scale_x_date(limits = c(start, end)) 
## Warning: Removed 1174 row(s) containing missing values (geom_path).

Once we have daily counts we count use more of lubridate, like aggregating by month:

base=ggplot(crimeDate,
            aes(x=floor_date(date, "month"),
                y=count))
monthly= base  + geom_line(alpha=0.3) 
monthly= monthly + scale_x_date(limits = c(start, end))
monthly
## Warning: Removed 1174 row(s) containing missing values (geom_path).

We could also add a trend:

# adding a trend:
monthlyTrend = monthly + stat_smooth(color = "red",
                      method = "loess")
monthlyTrend
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 1174 rows containing non-finite values (stat_smooth).
## Warning: Removed 1174 row(s) containing missing values (geom_path).

What about faceting by crime? However, our crimeDate data frame does not have that information. Let’s redo it:

crimeDate2=table(crime_nona$Occurred.Date,crime_nona$crimecat)%>%
                as.data.frame()  # date will be a factor
head(crimeDate2,10)
##          Var1               Var2 Freq
## 1  1964-06-15 AGGRAVATED ASSAULT    0
## 2  1973-01-01 AGGRAVATED ASSAULT    0
## 3  1975-12-16 AGGRAVATED ASSAULT    0
## 4  1978-01-01 AGGRAVATED ASSAULT    0
## 5  1979-01-28 AGGRAVATED ASSAULT    0
## 6  1979-07-04 AGGRAVATED ASSAULT    0
## 7  1980-01-01 AGGRAVATED ASSAULT    0
## 8  1981-02-14 AGGRAVATED ASSAULT    0
## 9  1981-08-22 AGGRAVATED ASSAULT    0
## 10 1985-09-13 AGGRAVATED ASSAULT    0

The column of dates appears as a factor; let’s reformat crimeDate:

names(crimeDate2)=c("date","crime",'count')
#formatting column in Freq Table:
crimeDate2$date=as.Date(crimeDate2$date)

#then
head(crimeDate2,10)
##          date              crime count
## 1  1964-06-15 AGGRAVATED ASSAULT     0
## 2  1973-01-01 AGGRAVATED ASSAULT     0
## 3  1975-12-16 AGGRAVATED ASSAULT     0
## 4  1978-01-01 AGGRAVATED ASSAULT     0
## 5  1979-01-28 AGGRAVATED ASSAULT     0
## 6  1979-07-04 AGGRAVATED ASSAULT     0
## 7  1980-01-01 AGGRAVATED ASSAULT     0
## 8  1981-02-14 AGGRAVATED ASSAULT     0
## 9  1981-08-22 AGGRAVATED ASSAULT     0
## 10 1985-09-13 AGGRAVATED ASSAULT     0

Let’s keep focusing from 2010:

base=ggplot(crimeDate2,
            aes(x=floor_date(date, "month"),
                y=count))
monthly= base  + geom_line(alpha=0.3) 
monthly= monthly + scale_x_date(limits = c(start, end))

# adding a trend:
monthly = monthly + stat_smooth(color = "red",
                      method = "loess")
monthly + facet_wrap(~crime)
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 23480 rows containing non-finite values (stat_smooth).
## Warning: Removed 1174 row(s) containing missing values (geom_path).

Alternatively,

monthly + facet_wrap(~reorder(crime,-count))
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 23480 rows containing non-finite values (stat_smooth).
## Warning: Removed 1174 row(s) containing missing values (geom_path).

We just reorganized the previous plot so that we highlight the most and least common crimes along that time period.

So far, lines have been used to report counts of the crimes. We can also analyze the distribution of the counts using histograms. I mean:

crime2010since=crime[crime$Occurred.Date>"2010-1-1",]
crimeTotalCountsDay=as.data.frame(table(crime2010since$Occurred.Date))
crimeTotalCountsDay$Var1=as.Date(crimeTotalCountsDay$Var1)
names(crimeTotalCountsDay)=c('date','counts')
ggplot(data=crimeTotalCountsDay, aes(x=counts)) + geom_histogram() + xlab("crime per day")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The plot above shows a distribution of crimes per day since 2010. Check this summary:

summary(crimeTotalCountsDay$counts)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    43.0   114.0   127.0   127.1   140.0   199.0

This is telling that you had a day when there were 199 crimes:

# checking the original data:
sort(table(crime2010since$Occurred.Date),decreasing = T)[1:3]
## 
## 2017-07-01 2017-05-26 2016-01-20 
##        199        192        186

From the summary you also know that since 2010, you can expect at least 43 crimes per day, or that the mean number of crimes per day is 126. Let’s see a distribution per year:

tapply(crimeTotalCountsDay$counts,
       year(crimeTotalCountsDay$date), FUN=summary)
## $`2010`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    59.0   107.0   119.0   118.6   129.0   171.0 
## 
## $`2011`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    58.0   100.0   113.0   113.1   124.0   169.0 
## 
## $`2012`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    43.0   102.2   112.0   112.0   122.0   164.0 
## 
## $`2013`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    82.0   112.0   124.0   124.8   137.0   174.0 
## 
## $`2014`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    77.0   122.0   136.0   135.1   147.0   181.0 
## 
## $`2015`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    89.0   119.0   130.0   130.6   142.0   184.0 
## 
## $`2016`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    82.0   124.2   134.0   134.4   145.0   186.0 
## 
## $`2017`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    82.0   127.0   138.0   137.8   148.0   199.0 
## 
## $`2018`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    48.0   129.2   139.0   139.1   148.8   183.0

If you need to plot this information, we need to add the column with year to the frequency table crimeTotalCountsDay:

crimeTotalCountsDay$year=year(crimeTotalCountsDay$date)
#you have
head(crimeTotalCountsDay,15)
##          date counts year
## 1  2010-01-02    128 2010
## 2  2010-01-03    128 2010
## 3  2010-01-04    135 2010
## 4  2010-01-05    136 2010
## 5  2010-01-06    133 2010
## 6  2010-01-07    141 2010
## 7  2010-01-08    171 2010
## 8  2010-01-09    151 2010
## 9  2010-01-10    116 2010
## 10 2010-01-11    127 2010
## 11 2010-01-12    147 2010
## 12 2010-01-13    119 2010
## 13 2010-01-14    149 2010
## 14 2010-01-15    158 2010
## 15 2010-01-16    141 2010

Now, you can plot by year:

base = ggplot(crimeTotalCountsDay,
       aes(x = counts)) 
HistSimple=base + geom_histogram(fill='grey', color=NA) 
HistFacet=HistSimple+ facet_wrap(~year(crimeTotalCountsDay$date),
                                ncol = 1, #all in one column
                                strip.position = 'right')#,#year
HistFacet
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

In general, it would look better if we use densities:

base = ggplot(crimeTotalCountsDay,
       aes(x = counts)) + theme_classic()
densePlot=base + geom_density(fill='grey', color=NA) 
denseFacet=densePlot+ facet_wrap(~year(crimeTotalCountsDay$date),
                                ncol = 1, #all in one column
                                strip.position = 'right')#,#year
denseFacet

You can improve this with:

densePlot + 
        # reduce space between density plot
  theme(panel.spacing.y = unit(0.1, "lines"),
        # no title on y
        axis.title.y = element_blank(),
        # no text on y
        axis.text.y = element_blank(),
        # no line on y
        axis.line.y = element_blank(),
        # no ticks on y
        axis.ticks.y = element_blank(),
        # the border and background of each year in facet:
        strip.background = element_rect(colour="white"),
        # the text of each year in facet
        strip.text.y = element_text(size=12,
                                    color="grey",
                                    angle = 0))

We can also use similar plots to the ones used in the previous material (cat-num). Let’s keep duration longer than a year, and after 2000:

# new filtered data frame
crimeY2000=crime_nona[crime_nona$year>=2000 & crime_nona$DaysToReport>=365,]

# create new variable in YEARS:
crimeY2000$YearsToReport=crimeY2000$DaysToReport/365
#boxplot by Year
base=ggplot(data = crimeY2000,
            aes(x=as.factor(year),
                y=YearsToReport)) # is not ONE value
boxByYear=base + geom_boxplot()

boxByYear

Remember that although the boxplot is very informative, I recommend the use of familiar statistics :

# vector of colors named as stats:
myFunCols=c(Max='black',Median='purple',Min='blue')

# plotting
baseINV=ggplot(crimeY2000, aes(x=as.factor(year),
                             y=YearsToReport)) +
        theme_classic()
# points with AESthetics:
MAXs=baseINV + stat_summary(fun = max, geom = "point", 
                            aes(color='Max'))
 
MAXMINs=MAXs + stat_summary(fun = min, geom = "point", 
                            aes(color='Min'))

MxMnMDs=MAXMINs+ stat_summary(fun = median, geom = "point", 
                            aes(color='Median'))

MxMnMDs= MxMnMDs + scale_colour_manual(values=myFunCols,
                                       name="Stats") 
MxMnMDs

Notice that if we want to connect the points this may not be what you have in mind:

MxMnMDs + geom_line()

Remember that YearsToReport is NOT one value, but several. Then, connecting the dots requires to override the grouping of the dots, so we add group=1):

MxMnMDsL=MxMnMDs+ geom_line(stat="summary",fun='max',
                            aes(color="Max"),
                            group=1) +
                  geom_line(stat="summary",fun='min',
                            aes(color="Min"),
                            group=1) +
                  geom_line(stat="summary",fun='median',
                            aes(color="Median"),
                            group=1) 
MxMnMDsL + theme(axis.text.x = element_text(angle = 60,
                                            vjust = 0.5
                                            ))

Numeric-Numeric data

The study of bivariate relationships among numerical variables is known as correlation analysis. The data we have been using has few numerical columns, but we will produce two by aggregating the original data since 2015 by Neigborhood:

  • Aggregating days to report and neighborhood:
crime2015=crime_nona[crime_nona$year>=2015,]
# 1. MEAN of days it takes to report a crime by neighborhood
daysByNeigh=aggregate(data=crime2015,DaysToReport~Neighborhood,mean)

# you have:
head(daysByNeigh)
##      Neighborhood DaysToReport
## 1 ALASKA JUNCTION     3.624267
## 2            ALKI     3.922434
## 3   BALLARD NORTH     4.248983
## 4   BALLARD SOUTH     3.844756
## 5        BELLTOWN     2.497066
## 6      BITTERLAKE     4.383847
  • Aggregating crimes by neighborhood
# 2. Crimes by neighborhood
crimesByNeigh=as.data.frame(100*prop.table(table(crime2015$Neighborhood)))
names(crimesByNeigh)=c('Neighborhood', 'CrimeShare')
head(crimesByNeigh)
##      Neighborhood CrimeShare
## 1 ALASKA JUNCTION  1.5336142
## 2            ALKI  0.4430089
## 3   BALLARD NORTH  2.2081719
## 4   BALLARD SOUTH  3.3167513
## 5        BELLTOWN  2.7928590
## 6      BITTERLAKE  1.8523903

Since both data frames have the same neighboorhood, we can make one data frame by merging both:

num_num=merge(daysByNeigh,crimesByNeigh,by='Neighborhood')
str(num_num)
## 'data.frame':    58 obs. of  3 variables:
##  $ Neighborhood: Factor w/ 58 levels "ALASKA JUNCTION",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ DaysToReport: num  3.62 3.92 4.25 3.84 2.5 ...
##  $ CrimeShare  : num  1.534 0.443 2.208 3.317 2.793 ...

Let’s turn the Neighborhood into characters:

num_num$Neighborhood=as.character(num_num$Neighborhood)

Once we have the data organized, the usual option is the scatterplot:

base = ggplot(num_num, aes(x=DaysToReport,y=CrimeShare)) 
plot1= base +  geom_point() 
plot1

If you compute the Pearson correlation coefficient, you may not find a relevant correlation interesting:

cor.test(num_num$DaysToReport,num_num$CrimeShare,method = "pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  num_num$DaysToReport and num_num$CrimeShare
## t = -1.8145, df = 56, p-value = 0.07496
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.46559934  0.02412146
## sample estimates:
##        cor 
## -0.2356425

However, you can visually find something relevant. Let’s use ggrepel to show labels:

library(ggrepel)
plot1 + geom_text_repel(aes(label=Neighborhood))
## Warning: ggrepel: 44 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

plot1 + geom_text_repel(aes(label=Neighborhood),size=2,
                        min.segment.length = 0,
                        max.overlaps = 100)

Now we can limit the labels, annotating the ones that represent at least 5% of the crimes in the city:

plot1 + geom_text_repel(aes(label=ifelse(CrimeShare>=5,Neighborhood, "")),size=2,
                        min.segment.length = 0,
                        max.overlaps = 100)

Or the ones that take longer than a week to report:

plot1 + geom_text_repel(aes(label=ifelse(DaysToReport>7,Neighborhood, "")),
                        size=2,
                        min.segment.length = 0,
                        max.overlaps = 100)

Besides conditionally annotating the places, you can identify the area of the most salient behavior. Let’s highlight overlaping points:

library(hexbin)
heat = base +  geom_hex(bins = 10) + 
               scale_fill_distiller(palette ="Greys",
                                    direction=1) 
heat

The palettes can be selected from the brewer colors website. Using the same palette as before, we can try a different plot (stat_density_2d):

heat2 = base +  stat_density_2d(aes(fill = ..density..), 
                                 geom = "raster", contour = FALSE) + scale_fill_distiller(palette="Reds", direction=1) 
heat2

Now you have an approximate of the places representing the most common behavior:

heat2 + theme_light() + 
    geom_text_repel(aes(label=Neighborhood),size=2,color='black',
                        fontface='bold',
                        min.segment.length = 0,
                        max.overlaps = 100) +  
    scale_x_continuous(expand = c(0, 0),limits = c(2,5)) + 
    scale_y_continuous(expand = c(0, 0),limits = c(0.5,3)) 
## Warning: Removed 27 rows containing non-finite values (stat_density2d).
## Warning: Removed 396 rows containing missing values (geom_raster).
## Warning: Removed 27 rows containing missing values (geom_text_repel).

You can use geom_label_repel in some cases (but in this case might not work well).

heat2 + theme_light() + 
    geom_label_repel(aes(label=Neighborhood),size=2,color='black',
                        fontface='bold',
                        min.segment.length = 0,
                        max.overlaps = 100) +  
    scale_x_continuous(expand = c(0, 0),limits = c(2,5)) + 
    scale_y_continuous(expand = c(0, 0),limits = c(0.5,3))
## Warning: Removed 27 rows containing non-finite values (stat_density2d).
## Warning: Removed 396 rows containing missing values (geom_raster).
## Warning: Removed 27 rows containing missing values (geom_label_repel).

Sometimes increasing the size helps:

heat2 + theme_light() + 
    geom_label_repel(aes(label=Neighborhood),size=2,color='black',
                        fontface='bold',
                        min.segment.length = 0,
                        max.overlaps = 100) +  
    scale_x_continuous(expand = c(0, 0),limits = c(2,5)) + 
    scale_y_continuous(expand = c(0, 0),limits = c(0.5,3))
## Warning: Removed 27 rows containing non-finite values (stat_density2d).
## Warning: Removed 396 rows containing missing values (geom_raster).
## Warning: Removed 27 rows containing missing values (geom_label_repel).

Some of these plot work better when in interactive mode.